Goto

Collaborating Authors

 image understanding




Appendix 1 Back imagination and Back speech

Neural Information Processing Systems

Figure 1: The illustrative examples for two proposed techniques: Back-imagination and Back-speech. Tiny ImageNet [Le and Y ang, 2015] serves as a compact version of the comprehensive ImageNet dataset. The Stanford Sentiment Treebank-2 (SST -2) [Socher et al., 2013] is a sentiment classification dataset Given the scarcity of datasets for understanding natural language in visual scenes, we introduce a novel textual entailment dataset, named Textual Natural Contextual Classification (TNCC). This dataset is formulated on the foundation of Crisscrossed Captions [Parekh et al., 2020], an image In this work, we employ a uniform experimental configuration for both textual entailment and sentiment classification tasks. For the image classification task, we employ the ResNet18 [He et al., 2015] model, which is considered more suitable for small datasets.



Energy Consumption Analysis Details

Neural Information Processing Systems

The spike firing rate is defined as the proportion of non-zero elements in the spike tensor. In Table S1, we present the spike firing rates for all spiking tensors in spike-driven Transformer-8-512. SNNs are theoretically more energy efficient than counterpart ANNs. We employ two types of datasets: static image classification and neuromorphic classification. ImageNet-1K is the most typical static image dataset, which is widely used in the field of image classification.



HASSOD: Hierarchical Adaptive Self-Supervised Object Detection

Neural Information Processing Systems

Through extensive experiments on prevalent image datasets, we demonstrate the superiority of HASSOD over existing methods, thereby advancing the state of the art in self-supervised object detection. Notably, we improve Mask AR from 20.2 to 22.5 on L VIS, and from 17.0 to 26.0 on SA-1B.



Supplementary Material IEBins: Iterative Elastic Bins for Monocular Depth Estimation

Neural Information Processing Systems

Table 2 shows a similar performance trend as in NYU-Depth-v2 dataset with increasing number of bins. We report results on keyframes (selected by the ORB-SLAM2) and on all frames of sequences 01-10. The A TE (m) metric is used.